Redesign json array streaming for datafusion #31
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Upstream:
apache#19924
PR Description
Summary
This PR introduces a high-performance streaming architecture for reading JSON array format files (
[{...}, {...}, ...]) in DataFusion. The new design processes arbitrarily large JSON array files with constant memory usage (~32MB) instead of loading the entire file into memory.Motivation
The previous implementation had critical limitations:
Solution Architecture
The new design uses a streaming character substitution approach that converts JSON array format to NDJSON on-the-fly:
Key Components
JsonArrayToNdjsonReader: A streamingRead+BufReadadapter that performs character-level transformation:[and trailing],to\nChannelReader: Bridges async-to-sync boundary by receivingByteschunks from a channel and implementingstd::io::ReadJsonArrayStream: Custom stream wrapper that holdsSpawnedTaskhandles for proper cancel-safetyMemory Budget (~32MB total)
API Changes
format_arrayoption tonewline_delimited(inverted semantics for clarity)newline_delimited: true(default) → NDJSON formatnewline_delimited: false→ JSON array formatNdJsonReadOptionstoJsonReadOptions(with deprecation alias)compression_levelfield fromJsonOptionsTesting
JsonArrayToNdjsonReaderincluding:JsonOpenerwith JSON array formatLimitations